- 
                Notifications
    You must be signed in to change notification settings 
- Fork 37
[Prototype] Add data cleaning in fast-llm prepare, concept #210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| Demonstration and Discussion of ConceptThis code serves as a demonstration and discussion of the proposed concept. Key Decisions:
 Usage:To prepare a dataset, simply call: dataset = self._config.processors.apply(dataset)config would be something like this: processors:
  steps:
    -
     type: length_filter
     field: text
     min_length_chars: 100
     max_length_chars: 100000
    - ...@jlamypoirier, @tscholak What do you think? | 
| Hi @bigximik, thanks for putting this together. I appreciate the careful thinking you've put in here! However, let's simplify significantly. The goal isn't to design a general, modular pipeline system. It's just about adding these very specific cleaning filters. We already know exactly what filters we want and in what order. Here's what I'd suggest: 
 We can always refactor if more complexity is actually required down the line, but let's get this feature shipped quickly and cleanly first. Can you please move forward by just implementing the concrete filters directly? | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to mostly agree with @tscholak on this one. I think it's a good idea to make processors into modular Config/Configurable pairs since it's relatively simple and non-controversial, but anything more than that requires a bit more thinking and is probably a bit premature at this stage
        
          
                fast_llm/data/preparator/hf_processors/implementations/agregator.py
              
                Outdated
          
            Show resolved
            Hide resolved
        
      …ept for clamav, integration not tested
| Created basic implementation based on feedback.
 Next Steps
 | 
✨ Description
part of #112
Closes #
🔍 Type of change
Select all that apply:
📝 Changes
List the key changes introduced in this PR:
✅ Checklist
Make sure the following tasks are completed before submitting the PR:
General
Dependencies and Configuration
Testing
Performance Impact
📊 Performance Impact Details
If there is any impact on performance, describe it and provide benchmark results, if applicable:
🗒️ Additional Notes
Include any additional context, information, or considerations here, such as known issues, follow-up tasks, or backward compatibility concerns.